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Abstract —Using a proper model to characterize a time series 
is crucial in making accurate predictions. In this work we use 
time-varying autoregressive process (TVAR) to describe non¬ 
stationary time series and model it as a mixture of multiple 
stable autoregressive (AR) processes. We introduce a new 
model selection technique based on Gap statistics to learn 
the appropriate number of AR filters needed to model a time 
series. We define a new distance measure between stable AR 
filters and draw a reference curve that is used to measure how 
much adding a new AR filter improves the performance of 
the model, and then choose the number of AR filters that has 
the maximum gap with the reference curve. To that end, we 
propose a new method in order to generate uniform random 
stable AR filters in root domain. Numerical results are provided 
demonstrating the performance of the proposed approach. 

Keywords-G&p statistics; stable autoregressive filters; time- 
varying autoregressive process; uniform distribution. 

I. Introduction 

Modeling time-series has been of a great interest for a 
long time. A good model not only describes the observed 
data well, but also avoids over-fitting, which can reduce the 
predictive power of the model. Autoregressive (AR) models 
are one of the most commonly used techniques to model 
stationary time-series Ul. A time-varying autoregressive 
(TVAR) model is a generalized form of an AR model that is 
used to describe the non-stationarity of time-series El. An 
example of TVAR models is the regime-switching model 
m, which assumes that a non-stationary stochastic process 
is composed of different epochs/regimes each of which is a 
stationary process, and that the regimes switch according to 
a Markov process. Another example is the model proposed 
in 0, which uses a mixture of Gaussian AR models to 
describe time-series and uses an expectation-maximization 
(EM) algorithm to determine the parameters of the model. 
Wong and Li 0 used Akaike information criterion (AIC) 
a and Bayes information criterion (BIC) 0 to introduce 
a penalty on the complexity of the model and estimate the 
number of AR filters. In general AIC and BIC are shown to 
be suboptimal for determining the number of modes 0. 

Tibshirani et al. 0 introduced an intuitive technique for 
determining the appropriate number of clusters using Gap 


statistics. The general idea of the Gap statistics is to identify 
the number of clusters in a data set by comparing the 
goodness of fit for the observed data with its expected value 
under a reference distribution. In this work we extend the 
Gap statistics to time-series in order to identify the number 
of AR filters needed to describe a time-series. We use a 
reference curve to measure how much adding a new AR 
filter improves the model under reference distribution, and 
then choose the number of filters that has the maximum gap 
with that reference curve. 

In order to derive a reference curve, it is important to 
first clarify the meaning of the “goodness of fit”. In 0, 
the “goodness of fit” is measured by the sum of squared 
Euclidean distances of each data point from the center of 
the cluster it belongs to. But a different measure needs to 
be used in time-series to evaluate the performance of each 
model. In this work, our goal is to select models that have 
higher predictive powers, and thus we define the “goodness 
of fit” measure to be the mean squared prediction error 
(MSPE). We use MSPE to define a new distance measure 
between two stable filters in accordance with our need for 
a reference curve in Gap statistics. Our proposed distance 
measure differs from the previous distances (e.g., cepstral 
distance 0, discrete Eourier transform (DPT) 0, principal 
component analysis (PCA) Eol, and discrete wavelet trans¬ 
form (DWT) ifTtlH in that it naturally arises from the MSPE 
of the one-step prediction. 

Computing each point on the reference curve using the 
new distance measure turns out to be a clustering problem 
in the space of stable AR filters with a fixed size, which 
is solved using the Monte Carlo approach. To that end, we 
introduce an approach to generate uniform random stable 
filters with equal sizes, and apply the fc-medoids algorithm to 
approximate the optimal solution for the clustering problem. 
Numerical simulations show that the accuracy of the pro¬ 
posed Gap statistics in estimating the number of AR filters 
surpasses that of AIC and BIC. 

The remainder of the paper is organized as follows. 
Section |II] describes the model considered in this paper. In 
Section [HI] we propose the Gap statistics to estimate the 


number of AR filters in a time-series. Section |IV] presents 
some numerical results to evaluate the performance of the 
proposed approach. We make our conclusions in Section [V] 

II. Model Assumption and Its Estimation 

In order to model a given time series X = {xn}n=i’ we 
assume that each data point depends only on L previous 
points and that L is known in this paper. We use a time- 
varying autoregressive (TVAR) model to describe the value 
at time step n as follows 

L 

Xn ~ ^ ^ (1) 

where are real numbers, and £„ ^ A/'(0, cr^) are 

independent Gaussian noise. Assume that the first L points 
of a sample set, xq, • • • , Xi_i, are known. The vector form 
of this equation can be written as 

Xn — (2) 

where (pji — * * * ? 5 and Xn —2 

• • • , Xu-lV- In real scenarios, </>„ is a time-varying vector, 
and modeling the variations of </>„ can be complicated. For 
simplicity, we assume that </>„ G { 71 ,...,7m}, for n = 
1, 2, • • • ,N, where M is the number of AR filters used to 
describe X, and that each 7 ^ is a filter with size L. We 
refer to 7 ^ as mode m and call this model multi-mode AR 
model. 

Clearly, M can be seen as a parameter for a nested family 
of models, and larger M will fit the observed data better. 
But as mentioned before, the predictive power of the model 
drops if M is too large. Hence, a model selection procedure 
that identifies the appropriate number of modes is important. 
To that end, we evaluate MSPE for a range of values of M 
which is assumed to contain the number of modes. We first 
estimate the parameters of each of the candidate models, and 
then select the number of AR modes according to the Gap 
statistics developed in Section |III] For simplicity, we further 
assume the model with the following log-likelihood; 

N 

logp(X|0) = ^logp(x„ I Xn) 

n—1 

N / M \ 

= X! [ X! ^rnM{Xr, I 7™^;™, cr^) j (3) 

n—1 \m—1 / 

where = 1, <Xm > 0 for any fixed M, and J\f{x \ 

) denotes the density of Gaussian distribution of mean 
/j, and variance evaluated at x, i.e., J\f{x \ /i, a^) = 
(27r)“^/^tT“^ exp{—(x — p)^/(2ct^)}. 

Let 0 = {am,7m, = 1, • • • , Mj be the set of un¬ 

known parameters to be estimated. Next, we briefly describe 
how to approximate the maximum-likelihood estimation 
(MLE) of 0. Though computing the MLE in Q is not 
tractable, it can be approximated by a local maximum via the 


EM algorithm ifT^ . Let Z = {zn}n^i be the membership 
labels with = [zni,z„ 2 , ■■■ , ZnMr, where 


^nm 


1 if (/)„ = 7m 
0 otherwise 


The joint probability of X and Z can be written as a product 
of conditional probabilities as 

N M 

p(X, Z I 0) = {amPiXn I ■ 

n—1 m—1 

Thus the complete log-likelihood is 

N M 

logp(X,Z|0) = EE ^nm log (^OtmP{Xn I Xn)) ■ (4) 

n—1 m—1 

The EM algorithm produces a sequence of estimates by 
recursive application of E-step and M-step until some con¬ 
vergence criterion is achieved. For brevity, we provide the 
EM formulas below without detailed derivations. We note 
that the M-step uses a coordinate ascent algorithm to find a 
local maxima. 

E-Step: We take the expectation of (|4]i with respect to 
the missing data Z given the recent estimated unknown 
parameters, and obtain the following function (also refeiTed 
to as the “Q function”) 


N M 

Q(0 I X, Z, 0°“) = X! E (aniP(Xn I Xn)) , 

n=l m—1 


where 

a„iJ\f{xn I 7 ma:„,cr 2 ) 

Wnm — 

^ am'A/'(Xn I 7^,a:n,tT2) 

m' — l 

We note that the parameters involved in the right-hand side 
of the equation above take values from the last update. The 
“old” superscriptions are omitted for brevity. In other words, 
the E-step replaces the “missing data” Znm in Equation 
© by its expected values Wnm for n — 1, - ■ ■ , N, m — 
!,■■■ ,M. 

M-Step: Letting the derivatives of the Q function be zero 
leads to a coupled non-linear system that has no closed- 
form solution. Thus, our best hope is an approximation to 
the solution; for this we use the coordinate ascent algorithm 
to obtain a local maximum. For each am, we apply the 
Lagrange method with constraint = 1 to obtain 

N 

y} '^nm 

_ n—1 

N 

Wnm 

m—1 n—1 


N 

E w„ 

n—1 

w 


m = 1, • 


,M. 





Taking the partial derivative of the Q function with respect 
to 7 m and then we obtain the following local maximum 
\ -1 


Tm — 



nm^n^n 



nm^n^n 




If xpA is stable, the roots oi, • • • , ai, lie inside the unit circle 
(IOf I < 1). When we use if) a at time step n — 1 to predict 
the value at time n, i.e., in = if>'\Xn, the MSPE is 




E{{Xn - inf) = CT^ 


(9) 


M N 


V V '^nm {Xn 


2 m—1 n—1 

a = - 


N 


Suppose that we use a filter other than if)A to predict the 
value at time n. The mis-specified filter is denoted by ips = 
['ff’Bi, - ■ ■ Then the MSPE becomes 


in. Gap Statistics to Determine the Number of 
Modes 

In this work we use Gap statistics Q to estimate the num¬ 
ber of AR hlters in a time series. In this technique, a data set 
*8 = {yi,t/ 2 , • • • jUf) is clustered into M disjoint clusters 
Cl, C 2 , ■ ■ • ! Cm by minimizing the following within-cluster 
sum of distances (WCSD) 


Wm = min 

Ml 


M 

Y. Y 

m—1 yGCm 


,y) 


(5) 


where is the distance of y from yim, the center 

of the mth cluster. Each cluster Cm is dehned based on 
/ii,--- ,pm as 


Cm = {?/ : y G *8, d{yim,y) < d{yim',y) Vm' ^ m). (6) 


After computing Wm for M = 1, 2, • • • , M^ax where Mmax 
is assumed to be the largest possible number of clusters, the 
graph of log(WM) is standardized by comparing it to its 
expectation under a non-informative reference distribution. 
This can be chosen to be a uniform distribution. The point 
that has the largest difference with the reference curve is 
selected as the estimated number of clusters. 

Let fi(y, Hm) be the squared Euclidean distance (the most 
commonly used distance measure for clustering purposes). 
Then with this distance, Wm becomes within-cluster sum 
of squares. However, for clustering AR filters. Euclidean 
distances have been shown to be ineffective |[8l. Hence, we 
introduce a new distance measure in Sec. IIII- Al that is well- 
suited for AR clustering. 


A. Distance Measure for Autoregressive Processes 

In order to find a reference curve for Gap statistics, we 
derive the distance between two filters based on MSPE. We 
assume that the data is generated by a stable filter if )a = 
[il)Ai,il^A 2 , ■ ■ ■ .if^AhY with size L, i.e., 

Xn = ‘ilt\Xn + En, ^(0, (T^). (7) 


E{{Xn - inf) = e[{{iI) 1 - ifDXnf] + (10) 

Motivated by Equations (|9ll and (fTOl i. we define the distance 
between hlters if )a and ipB by 

D{lf)A,lf)B) = E ^^{{lf)'^-lf)l)Xnf'^ , (11) 


which can be calculated using the power spectral density of 

Xn as 


D {lf)A,lf>B) 


— / - -o— —diu. 

271 |1 - T'A(e-7“)|^ 

( 12 ) 


Using z = (j = \/^), we get 


D {lf)A,lf)B) 

_ f {'iiA{z)-AfB{z)){'SA{z~^)-'SB{z~^))dz 
27rj Jc (1 - 'i’A^z)) (1 - T'a( 2 -i)) z 

(13) 


Using Cauchy’s integral theorem. Equation (O can be 
written in terms of the roots oi, • • • , and 6i, • • • , as 


L n («fe - be) 

D {ipA, iPb) =o-‘^Y - 

'==1 ofc n (“fc “ 0-t) 

t=l 

e^k 


/ L 

0(1- akb\) 
t=\ 


\ 


L 

n(l-afca|) 


e=i 


- 1 


(14) 


for a/c 0,ak f ae, k £, where a* is the conjugate of a 
complex number a. Eor the degenerate cases when 0^=0 
or Ofe = ae, D{iPa,'4’b) reduces to lim DfipA^ipB) or 

afe->0 


lim 


DfipAjipB)- We note that the distance (fl4l i is not 


symmetric, i.e., D (iPa^^Pb) f- D{iPb,iPa)- The distance 
measure dehned in (fT4l i is proportional to cr^, which results 
in a constant log in the computation of log Wm for the 
reference curve. Since it is the same for different M, without 
loss of generality we can set cr^ = 1. 


Eor now we assume that Xn has zero mean. Let a{z) = 

L 

^ ipAez~^ be the characteristic polynomial of ipA, and 

e=i 

ai, • • • , ol denote the roots of 1 — T'a(z), i.e., 

L 

l--^A{z) = Y[{l-aez-f. ( 8 ) 

e=i 


B. Generation of Uniformly Distributed Random Filters 
As mentioned before. Gap statistics require a reference 
curve that is generated by clustering sampled data from a 
reference distribution, which is usually chosen to be uni¬ 
form. Therefore, uniform random generation of stable hlters 
(uniform in coefficient domain) is needed. Since the roots 
are used in Equation (fl4li . we use an approach to generate 











samples of roots that correspond to uniform samples of coef- 

L 

ficients. To simplify the notations, we let A(z) = Xi 

t=i 

L 

denote the polynomial — ^'( 2 )) = 

e=i 

A polynomial is stable (also referred to as Schur stable) if 
all of its roots lie inside the unit circle. For a polynomial of 
order L, we use c < [L/2\ and r respectively to denote its 
number of pairs of complex roots and number of real roots. 
Let 

L 

Rl = {(Ai, • • • , Xl) I ^ X(,z^~^ is stable} c 

1=1 

be the coefficient space of all stable polynomials of degree 
L, and let G Rl correspond to the polynomials that 
have c pairs of complex roots. We call a conhguration 
of Rl with parameters {L,c). Clearly, Rl are bounded 
subspaces of R^ and Rl = R^l^ U • • • U 

In this section we propose an approach to generate uni¬ 
formly distributed polynomials using roots. We hrst present 
the following theorem, which helps us to find the relation 
between the distribution of coefficients of a polynomial and 
its roots. 


Theorem 1. The determinant of the Jacobian matrix of the 
coefficients of a polynomial with respect to its roots is the 
Vandermonde polynomial of the roots, i.e.. 


det 


( d[Xi,--- 


; Al] 
,aL] 


l<u<v<L 


Q-ii). 


(15) 


(c') 

Furthermore, the volume of i?} is 

Vol{Rf) = [ 

J (xi, 


n 


|(a„ - a„)' 


,ac^Xa-\-jyc 


c\{L - 2c)! 


dxidyi■ ■ ■ doL, 


(16) 


where Ci = {ix,y) \ x^ + y'^ < l},Si = {x \ —1 < x < 
1 }- 


Proof: Let J(ai, • • • , ol) be the value of the left-hand 
side of Equation (fTSl l. which is a polynomial of ai, • • • , ol. 
For any positive integers k, u and v, it is clear that 


dau 




Thus J changes sign under any transposition of the Ou 
and Ot, by properties of the determinant, i.e. J is an 
alternating polynomial of oi,--- ,az,. It implies that J is 
divisible by the Vandermonde polynomial V(ai, ■ ■ ■ , ul) = 
n (civ - a„) d]. Furthermore, both V(ai, - ■ ■ , a l) 

l<u<v<L 


and J(ai, • • • , 0 /,) are homogeneous polynomials of degree 
(L—l)L/2, and the coefficients of the term ’••02 

are both 1. Therefore, we obtain V(ai,--- jUl) = 
J (^ 15***5 CLLf 

In order to compute the volume of R ^^^, consider the space 
Cl X C R^. Each point (xi,yi,- ■ ■, a 2 c-i-i 5 * * * 5 «l) 

in R^^ corresponds to a set of roots (xi +jyi,xi —jyi, * * *, 
020 - 1-15 ■ ■ ■ ,cll) of a stable polynomial in r!'1\ and thus there 
is a 2“c!(L — 2c)! : 1 mapping from C} x to 

(due to the permutation of roots). Therefore, 


Vol(4'=)) = [ 


{xi,yi)eCi 


/a[Ai,A 25 * 

* 5 A2C-1-I5 * * * 5 Al] a 


* ,a2c-i-i5 * * * ,aL] J 


2^c!(L-2c)! 

dxidyi■■ ■ doL- 
(17) 


Since ak = Xk + jyk, fc = 1, • • • , c implies that 

det ( ^ 

VoFi5yi*** 5a2c-ri5*** 5 0^) J 

we obtain Equation (fTSl) by combining Equations (fT6]) - (fT7l) . 


Following Theorem [T] it is not difficult to obtain the 
following result. We omit the proof for brevity. 


Corollary 1. Generating a sample uniformly from Rl can 
be done via the following three-step procedure: 

1) Randomly draw 


c 


Multinomial 


yoKRf) 

VoI(Rl) ’ 


Vol{R^}^^'')\ _ 
VoI(Rl) j ’ 


2) Generate a random sample {xi,yi,- ■ ■ ,Xc,yc, ci 2 c+i, 

(c) 

***> ciL) from R)^ according to the (unnormalized) 
density 


\av-au\- (18) 

l<u<v<L 

ai—xi+jyiG" .ac—Xc-\-jyc 


3) Obtain (Ai, • • • ,Xl) by computing 


Afe = (- 1 )'" ^ 04 ••• 04 , fc = I,--- ,L. 

i<^i<'"<4<i 


In practical implementations, the parameters of multi¬ 
nomial distribution in the first step require only one-time 
computation, and the second step can be realized by a 
sequence of one-dimensional reject samplings. 


C. Calculating the Reference Curve 

Based on the discussions above, the following procedure 
describes how to derive the reference curve and Gap statis¬ 
tics. 

1) Generate F uniform random stable biters (8 = {*01, 
*02, * * * 5 ipp} with a given size L, using the technique 
introduced in Section IIII-BI 













2) Suppose that {I,-- - ,Mniax} is the candidate set of 
numbers of modes. For each M = 1,2,-■■ 
cluster ® into M disjoint clusters Ci, (72, • • • , Cm by 
minimizing; 


Wm — uiin 

■yi,--- ."TM 



M 


m— 1 


+ 

(19) 


where D{'^rn,'4>) has been dehned in Section ITll-Al and 
the clusters Cm are similarly dehned as in (|6llljFor this 
step, we hrst generate the matrix whose elements are 
pairwise distances between sampled hlters, and then 
run the A:-medoids algorithm lfT4l to approximate the 
optimum of ( [19] ). 

3) Plot the reference curve, which is log(WM) against M 
for M = 1, • • • , Mmax). We note that the reference 
curve is model independent. 

4) Plot the empirical curve given the MSPE for M — 
1,2,-•• jMmax, using the observed data, postulated 
model, and the model htting approach. For example, 
the postulated model in this paper is the mixture of AR 
introduced in Section |n| and the model htting approach 
is the EM algorithm. 

5) Finally, the number of AR mixtures that corresponds to 
the largest gap between the two curves is selected. 

Fig.di illustrates the sampled coefficients of stable hlters 
of size 2 randomly generated using the technique described 
in Section Illl-Bl The centers of the clusters (which are ap¬ 
proximated as some of the generated hlter samples) obtained 


* To make it consistent with the MSPE in Go), we put an additional “1” 
on the right hand side of definition GD compared with G). The “1” was 
used to represent the noise variance , since log(cr^) becomes a lineai* 
term in log Wm thus can be negligible. 
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Figure 1. The coefficients of 10000 second order stable filters (z^ + Xiz-\- 
A 2 ) that are independently generated from the uniform distribution on R 2 , 
and the centers for different number of clusters. 



Figure 2. Reference curves for L = 1, 2, 3, and 4 and Mmax = 6. 
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Estimated Number of AR Filters 
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0 
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97 
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0 

69 

28 

3 

0 
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eg 

BIC 

0 

91 

9 

0 

0 
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34 

OO 

Gap 

0 

0 

0 

7 

13 

34 

46 


Table I 

The estimated number oe AR filters for three different 

SCENARIOS USING AIC, BIC AND GAP STATISTICS (WITH THE TRUE 
NUMBER OF FILTERS FOR EACH SCENARIO HIGHLIGHTED) 


using A:-medoids algorithm are also shown in this figure for 
M = 1, 2,..., 6. These centers are calculated based on the 
average of 20 random instances, each with 1000 samples. 
Fig. |2| shows the reference curves for L = 1,2,3 and 4. 
Similar to Fig. [T| the reference curves are plotted based on 
20 random instances, each with 1000 samples. 

IV. Numerical Results 

We have generated zero-mean time-series with size N = 
1400 using the following three different models to evaluate 
the performance of the proposed model selection technique. 

Scenario 1: 4 AR hlters with lengths L = 2. The AR hlter 
at each time-instance is randomly drawn from a multinomial 
distribution with parameters (0.25, 0.25, 0.25, 0.25). 
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Scenario 2: 2 AR filters with lengths L = 4. The AR filter 
at each time-instance is randomly drawn from a Bernoulli 
distribution with parameters (0.4, 0.6). 

Scenario 3: 1 AR filters with lengths L = 1. The time- 
series is divided into 7 equal parts and only one AR filter 
is used to draw the data-points in each part. 

For each scenario, 100 time-series of length 1400 are 
generated using uniformly distributed random AR filters. For 
each time series, EM is run 50 times with different random 
initializations to increase the chance that the final estimates 
are close to the global optimum. Table |T] shows the estimated 
number of AR filters using AIC, BIC, and Gap statistics. The 
true number of filters for each scenario is highlighted. As it 
can be observed. Gap statistics outperforms AIC and BIC in 
all of the three scenarios. In the third scenario, AIC and BIC 
are not able to estimate the number of AR filters correctly, 
and severely underestimate the number of modes. While the 
Gap statistics finds the correct number of AR filters 46% of 
the time. 

For small L and large M, the uniformly generated AR 
filters are more likely to be close to one another, and 
thus identifying them as two separate modes becomes more 
challenging. Nevertheless, Gap statistics still outperforms 
AIC and BIC for this scenario. By increasing L, the volume 
of the space of stable filters Rl explodes and the average 
distance (defined in (fl^ l between two randomly chosen 
filters becomes larger, which makes the mode separation 
much easier for large L’s. In that case, the gain of our 
approach is more pronounced. 

V. Conclusions 

In this work, we introduced a new model selection 
technique based on Gap statistics in order to estimate the 
number of stable AR mixtures for modeling a given time 
series. The Gap statistics was extended to stable filters using 
a new distance measure between stable AR filters. This 
distance measure in turn was derived based on mean squared 
prediction error (MSPE). We also proposed a method to 
generate uniform random stable AR filters in the root 
domain, in order to compute the reference curve. This may 
be of some independent interest on its own right. Simulation 
results were provided demonstrating the performance of our 
proposed approach. 
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